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(57) Abstract: Hashes are short summaries or signatures of data files which can be used to identify the file. Hashing multimedia 
content (audio, video, images) is difficult because the hash of original content and processed (e.g. compressed) content may differ 
significantly. The disclosed method graerates robust hashes for multimedia content, for example, audio clips. The audio clip is 
divided (12) into successive (preferably overlapping) fi^ies. For each frame, the fiequency spectrum is divided (15) into bands. 
A robust property of each band (e.g. energy) is computed (16) and represented (17) by a respective hash bit An audio clip is thus 
represented by a concatenation of binary hash words, one for each ft:ame. To identify a possibly compressed audio signal, a block of 
hash words derived therefiom is matched by a computer (20) widi a large database (21). Such matching strategies are also disclosed. 
In an advantageous embodiment, the extraction process also provides information (19) as to which of the hash bits arc the least 
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Generating and matching hashes of multimedia content 

FffiLD OF THE INVENTION 

The invention relates to a method and arrangement for generating a hash 
signal identifying an information signal. The invention also relates to a method and 
arrangement for matching such a hash signal with hash signals stored in a database. 

5 

BACKGROUND OF THE INVENTION 

Hash functions are g^erally known in the field of cryptography, where they 
are used, inter alia, to identify large amounts of data. For instance, in order to verify correct 
reception of a large file, it suffices to send the hash value (also referred to as signature) of 

10 tiiat file. If the returned hash value matches the hash value of the original file, there is almost 
complete certainty that the file has been correctly received by the receiving parfy. The 
remaining uncertainty is introduced due to the fact that a collision might occur: i.e. two 
different files may have the same hash value. A carefiiUy designed hash fimction minimi zes 
the probability of collisioa 

15 A particular property of a cryptographic hash is its extreme fragility. Flipping 

a single bit in the source data will generally result in a completely different hash value. This 
makes cryptogr^hic hashing unsuitable for identifying multimedia content where different 
quality versions of the same content should yield the same signature. Signatures of 
multimedia cont^t that are to a certain extent mvariant to data processing (as long as the 

20 processing retains an acceptable qualify of the cont^t) are refuted to as robust signatures or, 
vMch is our preferred naming convention, robust hashes. By using a database of robust 
hashes and content identifiers, unknown content can be idratified, even if it is degraded (e.g. 
by compression or AD/DA conversion). Robust hashes capture the perceptually essential 
parts of audio-visual content. 

25 Using a robust hash to identify multimedia content is an alternative to using 

watermarking technology for the same purpose. There is, however, also a great difference. 
Whereas watermarking requires action on original content (viz. watermaik embedding) 
before being released, with its potential impact on content qualify and logistical problems, 
robust hashing requires no action before release. The drawback of hashing technology is that 
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access to a database is needed (e.g. hashing is only viable in a connected context), whereas 
watennark detectors can operate locally (for example in non-connected DVD players). 

United States Patent 4,677,466 discloses a known method of deriving a 
signature from a television signal for broadcast monitoring. In this prior art method, the 
S signature is derived from a short video or audio sequence after the occurrence of a specified 
event such as a blank frame. 

OBJECT AND SUMMARY OF THE INVENTION 

It is a general object of the invention to provide a robust hashing technology, 

10 More particularly, it is a first object of the invention to provide a method and arrangement for 
extracting a limited number of hashing bits fix>m multimedia content The hashing bits are 
robust, but not in a sense &at the probability of bit errors is zero. It is known that non-exact 
pattern matching (i.e. searching for the most similar hash value in tibe database) is NP- 
complete. In layman's terms, this means that the best search strategy is an exhaustive search, 

15 which is prohibitive in many applications dealing with large databases. Therefore, a second 
object of the invention is to provide a method and arrangement that overcomes this NP- 
complete search complexity. 

The first object is achieved by dividing the information signal mto successive 
(preferably overlapping) frames, computing a hash word for each firame, and concatenating 

20 successive hash words to constitute abash signal (or hash in short). The hash word is 
computed by thresholding a scalar property or a vector of properties of the information 
signal, for example, the energy of disjoint firequency bands or the mean luminance of image 
blocks. 

The second object is achieved by selecting a single hash word of an input 
25 block of hash words, searching said hash word in the database, calculating a difference 
between the input block of hash words and a corresponding stored block of hash words. 
These steps are repeated for frurdier selected hash words until said difference is lower than a 
predetermined threshold 

Further features of the invention are defined in the subclaims. 

30 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a schematic diagram of an embodiment of an arrangement for 
extracting a hash signal &om an audio signal in accordance with the invention. 
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Fig. 2 is a diagram illustrating the subdivision of an audio signal spectrum into 
logariUmiically spaced bands. 

Fig. 3 is a diagram illustrating hash words extracted from an audio clip . 

Fig. 4 is a schematic diagram of an embodiment of an arrangement for 
S extracting a hash signal fiom a video signal in accordance with the invention. 

Fig. S is a diagram illustrating hash words extracted from a video sequence. 

Fig. 6 is a flow chart of operations carried out by a computer which is shown 
in Fig. 1 in accordance with the invention. 

Fig. 7 is a diagram to illustrate the operation of a computer which is shown in 

10 Fig. 1. 

Fig. 8 shows a graph ofthe number or bit errors in hash words forming an 
retracted hash block which is shown in Fig. 3. 

Fig. 9 shows a graph ofthe most reliable bit ofthe hash words ofthe hash 
block which is shown in Fig. 3. 
15 Fig, 10 is a flow chart of operations carried out by the computer which is 

shown in Fig. 1 in accordance with a further embodiment of the invention. 

DESCRIPTION OF EMBODIMENTS 

Before describmg a preferred embodiment, a general description of 

20 consid^ations underlying this invention will be elucidated. 

Two signals (audio, video, image) can differ quite drastically (e.g. by 
compression) in a signal theoretical sense, >;diereas they are perceptually indistinguishable. 
Ideally, a hash flmction mimics the behavior of the human auditory system (HAS) or human 
visual system (HVS), i.e. it produces the same hash signal for content that is considered the 

25 same by the HAS/HVS. However, many kinds of processing (compression, noise addition, 
echo addition, D/A and AID conversion, equalization etc.) can be applied to the signal and 
there is no algorithm that is able to mimic the HAS/HVS perfectly. A complicating factor is 
that even ^e HAS/HVS varies fix>m person to person as well as in time, and even the notion 
of one single HAS/HVS is untenable. Also, the classical definition of a hash does not take 

30 time into accoimt: a robust hash should not only be able to identify the content, but should 
also be able to identify time (intervals). For this reason the following definition for a robust 
hash is herein used: A robust hash is a function that associates with every basic time-unit of 
multimedia content a semi-imique bit-sequmce that is continuous with respect to content 
similarity as perceived by the HAS/HVS. 
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In other words, if the HAS/HVS identifies two pieces of audio, video or image 
as being very similar, the associated hashes should also be very similar. In particular, the 
hashes of original content and compressed content should be similar. Also, if hash words are 
computed for overlapping frames, the hash words should be similar, i.e. liashes should have a 
S low pass character. On the otilier hand, if two signals really represent dijQTerent content, the 
robust hash should be able to distii^;uish the two signals (semi-unique). This is similar to the 
collision requirem^t for classical cryptographic hashes. The required robustness of the 
hashing function is achieved by deriving the hash function firom robust features (properties), 
i.e. features that are to a large degree invariant to processing. Robustness can be expressed by 

10 the Bit Error Rate (BER), which is defined as Hie ratio of the number of erroneous bits and 
the total number of bits. 

Robust hashing enables content identification which is the basis for many 
interesting applications. Consider the example of identification of content in a multimedia 
database. Si9)pose one is viewing a scene from a movie and would like to know fix)m vMch 

15 movie the shot originates. One way of finding out is by comparing tiie scene to all firagments 
of the same size of aU movies in the database. Obviously, this is totally infeasible in case of a 
large database: even a short video scene is represented by a large amount of bytes and 
potentially these have to be compared to the whole database. Thus, for this to work, one 
needs to store a large amount of easily accessible data and all these data have to be compared 

20 with the video scene to be identified. Therefore, there is both a storage problem (the 
database) as well as a computational problem (matching large amounts of data). Robust 
hashing alleviates both problems by reducing the number of bits needed to represent the 
video scenes: fewer bits need to be stored and fewer bits need to be used in the comparison. 

Robust hashing of audio signals will be described first. The audio signal will 

25 be assumed to be mono audio that has been sampled at a sample firequency of 44.1 kHz (CD- 
quality). If the audio is stereo, there are two options: eitiier hash signals are extracted for the 
left and the rig^t channel separately, or the left and the right channel are added prior to hash 
signal extraction. 

Even if we only have a short piece of audio (of the order of seconds), we 
30 would like to determine which song it is. As audio can be seen as an endless stream of audio- 
samples, it is necessary to subdivide audio signals into time intervals or fi:ames and to 
calculate a hash word for every firame. 

Very often, when trying to match hashes in a database^ it is impossible to 
detennine tiie firame boundaries. Tins synchronization problem is particularly applicable to 
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audio hashing. This problem is solved by dividing the signal into overlapping frames. 
Overlapping also ensures that hash words of contiguous frames have a certain amount of 
correlation. In other words, the hashes change slowly over time. 

Fig. 1 shows a schematic diagram of an embodiment of an arrangement for 
5 generating an audio hash signal in accordance with the invention. The audio signal is first 
downsampled in a downsampler 11 to reduce the complexity of subsequent operations and 
restrict the operation to a frequency range of 300-3000 Hz, v^ch is most relevant for the 
Human Auditory System. 

In a framing circuit 12, the audio signal is divided into frames. The frames are 

10 weighed by a Banning window having a length of 16384 samples (»0.4 seconds) and an 
overly &ctor of 3 1/32. The overlap is chosoa in such a way that a high correlation of the 
hash words between subsequent frames is ensured. The sfpectral representation of every frame 
is computed by a Fourier transform circuit 13. In the next block 14, the absolute values 
(magnitudes) of the (complex) Fourier coefficients are computed. 

15 A band division stage 15 divides the frequency spectrum into a number (e.g. 

33) of bands. In Fig. 1, this is schematically shown by selectors 151, each of vMch selects 
the Fourier coefficients of the respective band. In a preferred embodiment of the 
arrangement, the bands have a logarithmic spacing, because the HAS also operates on 
sq^roximately logarithmic bands. By choosing the bands in this manner, the hash will be less 

20 susceptible to processing changes such as compression and filtering. In the preferred 

embodiment. Hie first band starts at 300Hz and every band has a bandwidth of one musical 
tone (i.e. tfie bandwidth increases by a factor of 2^'^^«1 .06 per band). Fig. 2 shows an 
example of a sfpectrum 201 of a fi:ame and the subdivision thereof into logarithmically spaced 
bands 202. 

25 Subsequently, for every band a certain (not necessarily scalar) characteristic 

property is calculated. Examples of prop^es are energy, tonality and standard deviation of 
tiie power spectral density. In general, the chosen property can be an arbitrary frinction of &e 
Fourier coefficients. Experimentally it has been verified that the energy of every band is a 
property that is most robust to many lands of processing. This energy computation is carried 

30 out in an energy computing stage 16. For each band, it comprises a stage 161 which 

coiiq)utes the siun of the (squared) magnitudes of the Fourier coefficients within that band. 

In order to get a binary hash word for each firame, die robust properties are 
subsequently converted into bits. The bits can be assigned by calculating an arbitrary fimction 
of the robust properties of possibly diff^:ent frames and then comparing it to a tiireshold 
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value. The threshold itself might also be a result of another function of the robust property 
values. 

In the present arrangement, a bit derivation circuit 17 converts the energy 
levels of the bands into a binary hash word. In a simple embodiment, the bit derivation stage 
S generates one bit for each band, for example, a ' T if tiie energy level is above a threshold and 
a '0' if the energy level is below said threshold. The thresholds may vary fiom band to band. 
Alternatively, a band is assigned a hash bit ' r if its energy level is larger than the energy 
level of its neighbor, otherwise the hash bit is *0' . The present embodiment uses an even 
improved version of the latter alternative. To prevent a major single frequency in the audio 
10 signal from producing identical hash words for successive frames, variations of the amplitude 
over time are also taken into account More particularly, a band is assigned a hash bit ' T if its 
energy level is larger than the energy level of its neighbor and if that was also the case in the 
previous frame, otherwise the hash bit is '0'. If we denote the energy of a band m of frame n 
by EB(n,m) and the m-th bit of the hash word H of frame n by H(n,m) , the bit derivation 
1 5 circuit 17 generates the bits of the hash word in the following manner: 

Jl if EB(n,m)-EB(n,m+l)-(EB(n-l,m)-EB(n-l,m+l))>0 
(n,m)-|^ if EB(n,m)^EB(n,m+l)-(EB(n-l,m)--EB(n-l,m+l))^0 

To this end, the bit derivation ckcuit 17 comprises, for each band, a first 
subtractor 171, a frame delay 172, a second subtracter 173, and a comparator 174. The 33 
energy levels of the spectrum of an audio frame are thus converted into a 32-bit hash word. 
20 The hash words of successive frames are finally stored in a buffer 18, which is accessible by 
a computer 20. The computer stores the robust hashes of a large number of original songs in a 
database 21. 

In a subsequent operation, the same arrangement computes the hash of an 
unknown audio clip. Reference numeral 31 in Fig. 3 shows the hash words of 256 successive 

25 overlapping audio fi:ames (f»3 seconds) of tiie audio clip as stored in the database 21. In the 
Figure, each row is a 32-bit hash word, a vAdto pixel represents a ' T bit of the hash word, a 
black pixel represents a *0' bit, and time proceeds from top to bottom. Reference numeral 32 
shows the hash words extracted from the same audio clip after MP3 compression at 32 kBit/s. 
Ideally, the two hash blocks should be identical, but due to the compression some bits are 

30 different The difference is denoted 33 in Fig. 3. 

Robust hashing of image or video signals will now be described. Again, the 
robust hashes are derived &om specific features of the information signal. The first question 
to be asked is in vMch domain to extract said features vMch determine the hash word. In 
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contrast to audio, where the frequency domain optunally represents tiie perceptual 
characteristics, it is less clear which domain to use- For complexity reasons it is preferable to 
avoid complex operations, like DCT or DFT transformations. Therefore, features m the 
spado-temporal domain are computed. Moreover, to allow easy feature extraction fix>m most 
S compressed video streams as well, features are chosen which can be easily computed fiom 
block-based DCT coefficients. 

Based on these considerations, the preferred algorithm is based on simple 
statistics, like mean and variance, computed over relatively large image regions. The regions 
are chosen m a fgiirly simple way: the image frame is divided into square blocks of 64 by 64 

10 pixels. The features are extracted from the lummance component. This is, however, not a 
fundamental choice: the chronunance components may be used, as well. As a matter of fact, 
the easiest way to increase tiie number of ha^ bits is to extract tiiem from the chrominance 
conq>onents in a similar v^y as tiie extraction from the luminance. 

Fig. 4 shows a block diagram of an arrangement for generating a hash signal 

1 5 identifying a video signal in accordance with the mvention. The arrangement receives 

successive frames of the video signal. Each frame is divided (41) in M+1 blocks. For each of 
these blocks, the mean of the luminance values of the pixels is computed (42). The mean 
luminance of block k m frame p is denoted F(pjc) for kr=0,. . .,M. 

In order to make the hash indepmdent of the global level and scale of the 

20 luminance, the luminance di£fermces between two consecutive blocks are confuted (43). 
Moreover, in order to reduce the correlation of the hash words in the temporal dkection, the 
difference of spatial differential mean luminance values in consecutive frames is also 
computed (44, 45). In otiier words, a simple spatio-temporal 2x2 Haar filter is applied to the 
mean luminance. The sign of the result constitutes (46) the hash bit H(p Jc) for block k in 

25 frame p. In mathematical notation: 

Hf kwi^ if(F(p,k)-F(p,k-l))-(F(p-l,k)-F(p-l,k-l))^0 
^ [0 if(F(p,k).F(p,k-l))-(F(p.l,k).F(p-l,k-l))<0 

In this example, each frame is divided m 33 blocks (i.e., M=32) of size 64x64. 
A complete hash H consists of the bits extracted from 30 consecutive frames. Such a hash 
block, consisting of 30 hash words of 32 bits each (960 bits) leads to a sufScientiy small &lse 
30 positive probability, as will be shown below. A typical original hash block is depicted 51 in 
Fig. S, where black and white correspond to '0* and *r, respectively. The corresponding hash 
block of the same material scaled horizontally to 94% is denoted by reference numeral 52. 
Numeral 53 denotes the difference between the hash blocks 51 and 52. In this case the bit 
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error rate equals 1 1 .3%. Note how indeed the erroneous bits have a strong coiielation in the 
temporal (vertical) direction. 

The process of matching extracted hash blocks to the hash blocks in a large 
database will now be described This is a non-trivial task since it is well-known that 
5 imperfect matching (remember that the extracted hash words may have bit errors) is NP- 
complete. This will be shown by means of the following (audio) example. In a database, 
100,000 songs of ^proximately five mmutes (=25000 hash words per song) are stored. It 
will be assumed that a hash block having 256 hash words (e.g. hash block 32 m Fig. 3) has 
been extracted from the unknown audio clip. It is now to be determined to which of the 

10 100,000 stored songs the extracted hash block matches best Hence the position of a hash 
block in one of the 100,000 songs has to be found, yAdch most resembles the extracted hash 
block, i.e. for which the bit mor rate (BER) is niinmial or, alternatively, for which the BER 
is lower than a certain threshold. The threshold directly determines the &lse positive rate, i.e. 
the rate at which songs are incorrectly identified from the database. 

15 Two 3 seconds audio clips (or two 30-firame video sequences) are declared 

similar if the Hamming distance between the two derived hash blocks Hi and H2 is below a 
certain threshold T. This threshold T dkectly determines the &lse positive rate i.e. the rate 
at which two audio clips / video sequences are incorrectiy declared equal (i.e. incorrectiy in 
the eyes of a human beholder): the smaller T, the smaller the probability Pf will be. On the 

20 other hand, a small value T will negatively effect the Mse negative probability Pib i-e. the 
probability that two signals are 'equal', but not identified as such. In order to analyze the 
choice of this threshold T, we assume that the hash extraction process yields random i.i.d. 
(independent and identically distributed) bits. The number of bit errors will then have a 
binomial distribution witii parameters (n,p), where n equals the mmib» of bits extracted and 

25 p (=0.5) is tiie probability that a '0' or '1' bit is extracted. Since n (32x256=8192 for audio, 
32x30=960 for video) is large in our plication, the binomial distribution can be 
approximated by a normal distribution with a mean ^=iip and standard deviation 
a = ^np(l-p) . Given a hash block Hi, the probability that a randomly selected hash block 
H2 has less than T=an errors with respect to Hi is given by: 




However, in practice the robust hashes have high correlation along the time 
axis . This is due to the large time correlation of the underlymg video sequence, or tiie overly 
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of audio frames. Experiments show that the number of erroneous bits are normally 
distributed, but that tiie standard deviation is approximately 3/2 times larger than the lAA 
case. Equation (1) therefore is modified to include this &ctor 3/2. 

P,(a) = ierfc[l^V2^] (2) 

S The threshold for the BER used during e^erunents was aN).2S. This means 

that, of 8192 bits, less than 2048 bit errors have to occur m order to decide that the hash 
block originates from the same song. In this case the bit errors have a normal distribution 
with a mean |li of np=4096 and a standard deviation a of 3V(np(l-p))=135.76. The chosen 
ftreshold setting then corresponds to the Mse alarm probability of lS.2a. Hence, the false 

10 alarm probability equals L8'10'^^ Note, however, that the felse alarm probability will be 
higher in practice if music with sunilar hash words (e.g. a Mozart piece played by two 
different pianists) is included in the database. 

Searching the position of the extracted hash block in the database can be done 
. by brute force matchmg. This will take around 2.5 billion (=25000x100,000) matches. 

1 5 Moreover, the number of matches increases linearly with the size of the database. 

In accordance witii an aspect of the invention, the computer 20 uses a more 
efficient strategy for finding the corresponding song in the database 21. Fig. 6 is a flow chart 
of operations carried out by the computer. Upon storing an original song in tiie database, the 
computer iq)dates a lookiq) table (LUT) in a step 60. The LUT is shown as a separate 

20 memory 22 in Fig, 1, but it will be £q)preciated that it will be part of the large database 

memory 21 in practice. As is shown in Fig. 7, the LUT 22 has an entry for each possible 32- 
bit hash word. Each entry of the LUT pomts to the song(s) and the position(s) in that song 
i^ere the respective hash word occurs. Since a hash word can occur at multiple positions in 
multiple songs, tilie song pointers are stored in a linked list Thus the LUT can generate 

25 multiple candidate soi^. Note that a LUT containing 2^^ entries can be impractical when 
there is only a limited number of songs in the database. In such a case, it is advantageous to 
implement the LUT with a hash table and a linked Ust. Reference numeral 70 in Fig. 7 
denotes a block of 256 hash words extracted from the unknown audio clip (e.g. hash block 32 
in Fig. 3). 

30 In a first embodhnent of the matching method, it will be assumed that every 

now and then a single hash word has no bit errors. In a step 61, a single hash word H(m) is 
selected from the hash block and sent to the database. Initially, this will be the last hash word 
H(256) of the extracted hash block. In tiie example shown in Fig. 7, this is the hash word 
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0x00000001 . The LUT in the database points to a certain position in song 1. Let it be 
assumed that this position is position p. In a step 62, the computer calculates the BER 
between the extracted hash block and tibe block of hash words from position p-255 until 
position p of song 1 (denoted 71 in Fig. 7). In a step 63, it is checked v^^ether the BER is low 

5 (<0-25) or high. If the BER is low, there will be a high probability that the extracted hash 
woids originate fiom song 1 . If the BER is high, either the song is not in the database or the 
single hash word H(m) contains an error. The latter will be assumed to be the case in this 
example. Another single hash word is then selected in a step 64 and looked up in the LUT. In 
Fig. 7, the last but one smgle hash word H(255) is now being looked up. This hash word 

10 appears to occur in song 2. The BER between input block 70 and stored block 72 appears to 
be lower Hian 0.25 now, so that song2 is identified as the song from which the audio clip 
origmates. Note that the last hash word in the stored block 52 is 0x00000000. Apparently, tiie 
previously selected hash word 0x0000001 had one bit error. 

The compute thus only looks at one single hash word at a time and assumes 

1 5 that every now and then such a single hash word has no bit errors. The BER of the extracted 
hash block is then compared with the corresponding (on the time axis) hash blocks of the 
candidate songs. The title of the candidate song with the lowest BER will be chosen as the 
song from which the extracted hash words originate, provided that the lowest BER is below 
the threshold (step 65). Otherwise, the database will report that the extracted hash block was 

20 not found. Another single hash word will then be tried. If none of the single hash words leads 
to success (step 66), the database will respond by reporting the absence of the candidate song 
in the database (step 67). 

The above-described method relies on the assimiption that every now and then 
an extracted hash word has no bit errors, i.e. it is p^ectly equal to the corresponding stored 

25 hash word. Extensive experiments have shown that this occurs regularly a few times per 
second for most audio. Hiis is shown, for example, in Fig. 8 which shows the number of bit 
errors m the 256 hash words forming the extracted block of Fig. 3B. Thirteen hash words 
occur without any bit errors in this 3-second audio clip. 

However, it is unlikely that hash words without any bit mors occur when the 

30 audio is severely processed. In that case, the title of the song cannot be retrieved by means of 
the previous method. To this end, another embodiment of the matching method will be 
described. This method uses soft information of the hash extraction algorithm to find the 
extracted hash words in the database. Soft information is understood to mean tiie reliability of 
a bit, or the probability that a hash bit has been retrieved correctiy. In this embodiment, the 
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arrangement for extracting the hash words mcludes a bit reliability determining circuit The 
bit reliability determining circuit is denoted 19 in the audio hash extraction arrangement 
which is shown in Fig. 1 . It circuit receives the differential energy band levels in the form of 
real nmnb^. If the real number is v^ close to the threshold (AA^ch is zero in this example), 
5 the respective hash bit is unreliable. If instead the number is very &r fromtiiie threshold, it is 
a reliable hash bit The threshold can be fixed or controlled such that the number of reliable 
bits is fixed 

The bit reliability deterniming circuit 19 determines the reliability of every 
hash bit, and thus enables the extraction arrangement or the computer 20 to generate a list of 

1 0 most probable alternative hash words for each hash word. By assuming again that at least one 
of the alternative hash words is correct, the song title can be received correctly and easily. 
Fig. 9 shows, for all the 256 hash words of hash block 32 in Fig. 3, which bit of the hash 
word is the most reliable. 

Fig. 1 0 is a flow chart of operations carried out by the computer in this 

1 5 embodiment of the method of finding the extracted hash block in the database. The same 
reference numerals are used for operations already described before. Again, the last extracted 
hash word (0x00000001, see Fig. 7) of the hash block is initially selected and sent to the 
database (step 61). The LUT in the database points to position p in song 1 . The BER between 
the extracted hash block and the corresponding block 71 in song 1 is calculated (step 62). 

20 Meanwhile, it is known from the previous example that the BER is hi^ In a step 101, the 
computer now consults the bit reliabiUly determining circmt 19^^^ 1) and learns that bit 0 
is the least reliable bit of this particular hash word. The next most probable candidate hash 
word is now obtained by flipping said bit The new hash word (0x00000000) is sent to the 
database in a step 102. As is shown in Fig. 7, the hash word 0x00000000 leads to two 

25 possible candidate songs in the database: song 1 and song 2. If, for example, the extracted 
hash words now have a low BER with the hash words of song 2, song 2 will be identified as 
the song &om which the extracted hash block originates. Otherwise, new hash word 
candidates will be generated, or another hash word will be used to try to find the respective 
song in the database. This strategy is contizmed until it is found in a step 103 that there are no 

30 fiirther alternative candidate hash words. 

Note that, once a piece of audio is identified in practice as originating firom a 
certain song, the database can first try to match the extracted hash words with that song 
before generating all tiie candidate hash words. 
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A very simple way of generating a list of most probable hash words is to 
include all the hash words with N most reliable bits being fixed and every possible 
combination for the r^naining bits. In the case of 32 bits per hash and choosing N=23, a list 
of 5 1 2 candidate hash words is required. Furthermore it means that the 9 least reliable bits of 
5 the hash word can be wrong before an audio excerpt cannot be identified anymore. For the 
case shown in Figure 6, this means that 117 hash words, instead of 1 3 with the previous 
method, will yield a correct pointer to the song in the database. 

In an alternative embodiment of the matching method, the matching is done 
only on the basis of hash bits being marked as reliable. This method is based on the insight 
10 that it is unnecessary to compare unreliable bits of a received hash with the corresponding 
bits in the database. This leads to a far smaller bit error rate, although tiiis comes at the cost 
of a more complicated search strategy and a larger bandwidth needed to transmit all 
necessary infonnation to the database. 

A few applications of robust hashing will now be described. 
15 - Broadcast Monitoring: A broadcast monitoring system consists of two parts: a central 
database containing the hashes of a large number of songs, and monitoring stations that 
extract a hash block firom the audio that is broadcast by, for instance, radio stations. The 
monitoring station will send the extracted hash block to the central database and then the 
database will be able to determine vMch song has been broadcast. 
20 - Mobile Phone Audio Mo: Imagine that you are in a bar and hear a song of which you 

want to know the title. You then just pick up your mobile telephone and call an audiohash 
database. The audiohash database will then hear the song and extract a hash block. If it 
then finds the hash block in the database, it will report back the title of the song. 
- Connected Content (MediaBridge): The company Digimarc currently has an application 
25 called MediaBridge, ^ch is based on watermarking technology. The idea is that a 
watermark in a piece of multimedia will direct a user to a certain URL on the Internet 
where he can get some extra information. E.g. an advertisement in a magazine is 
watermarked. By holding this advertisement in fi^nt of a webcam, a watermark detector 
will extract a watermark key that is sent to a database. This database then contains the 
30 URL to which the user wiU be redirected. Hie same application can work with the use of 
robust hashing technology. In &e future, one might even think of a person pointing his 
mobile videophone at a real-life object The audio hash database will then report back 
information about ^tns object, either directly or via an URL on the Internet 
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- Mdtimedia Quality Metering: Ifthe hash words of ^ 

in the database, a quality measure can be obtained by determining the BER of the 
extracted hash words of processed multimedia content 

From an abstract point of view, the robust audio hashes are derived fix>m an 
S audio signal by comparing energy in different frequency bands and over time* A 

generalization of this approach is to consider any cascade of LTI and non-linear functions. In 
particular, a robust hash can also be obtained by applying a (dyadic) filter bank (an LTI 
operator), followed by squaring or taking absolute words (a non-linear function), followed by 
a difference operator ovct time and/or band (an LIT operator), finally followed by a 
1 0 thresholding operator. By applying a careftdly designed linear filter bank as an initial 

operator, the complexity of a FFT can be avoided. Moreover, as many compression engrnes 
have a linear filter bank as an initial phase, &ere is the option to integrate feature extraction 
with compression. 

It is further noted that robust bashmg and digital watermarks can be used in 

15 combination to identify content The method described above and some watermark detection 
algorithms have a number of initial processing steps in common, viz. the computation of the 
spectral representation. This leads to the idea that watermark detection and feature extraction 
can easily be integrated in one applicatioiL Both retrieved watermark and hash words can 
then be sent to a central database for further analysis, to allow identification of content 

20 In summary, the disclosed mediod generates robust hashes for multunedia 

content, for example, audio clips. The audio clip is divided (12) into successive (preferably 
overl^ping) fimnes. For each frame, the fi-equency spectrum is divided (15) into bands. A 
robust property of each band (e.g. energy) is computed (16) and represented (17) by a 
respective hash bit An audio clip is thus rq)resented by a concatenation of binary hash 

25 words, one for each fi»me. To identify a possibly compressed audio signal, a block of hash 
words derived therefirom is matched by a compute (20) with a large database (21). Such 
matching strategies are also disclosed, hi an advantageous embodiment, the extraction 
process also provides information (19) as to which of the hash bits are the least reliable. 
Flipping these bits considerably improves the speed and performance of the matching 

30 process. 
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CLAIMS: 



1 , A method of generating a hash signal identifying an infonnation signal, the 

method comprising the steps of: 

- dividii^ the inforniation signal into frames, 

- computing a hash word for each frame, and 

- concatenating successive hash words to constitute the hash signal. 

5 2. The method of claun 1 , >;v^erein said computing step comprises the steps of: 

- dividing each frame of the infonnation signal into disjoint bands or blocks; 

- calcukting a property ofthe signal in each of said bands or blocks; - 

- comparing the properties in the bands or blocks with respective thresholds; . 

- representing the results of said comparisons by rei^ctive bits of the hash word. 

10 

3. The method as claimed in claim 2, wherein the property of a neighboring band 
or block constitutes said threshold. 

4. The method as claimed in claim 2, wherein the property of a corresponding 
1 5 band or block in a previous fiiame constitutes said threshold. 

5. The method as claimed in claim 2, wherein the bands or blocks are frequency 
bands of the frequency spectrum of the respective frame of the infonnation signal. 

20 6. The method as claimed in claim 5, wherein the frequency bands have an 

ino^easing bandwidtii as a Amotion of the frequency. 

7. The method as claimed in claun 5, wherein said property is tiie energy of a 
frequency band. 

25 

8. The method as claimed in claim 5, wherein said property is the tonality of a 
frequency band 



wo 02/065782 PCT/IB02/00379 

15 

9. The method of claim 1, wherein said infomiation signal is divided into 

overlapping frames. 

5 10. The method as claimed in claim 2, wherein the information idgnal is a video 

signal, the frames of v\^ch are divided into blocks, the mean luminance of a block 
constituting the property of said block. 

11. The method of claim 2, further comprising the step of using the inputs of said 
10 comparing steps to generate information which is indicative of the reliability of the bits of the 

hash word. 

12. A method of generating a hash signal to identify an information signal, 
comprising the steps of: 

15^ - dividing the information signal into blocks; 

- extracting for each block a feature of the information signal within said block; 

- com^yaring the value of the extracted feature with a threshold; 

- generating for each block a hash bit indicating whether the value of the extracted feature 
is largo: or smaller than said threshold; 

20 - detennining for each block reUabiUty information indicating whether &eval^^ 
extracted feature differs substantially from said threshold; 

- combining said hash bits and said reliability information of the blocks into a hash value 
having reliable hash bits for which the extracted feature differs substantially from said 
threshold, and unreliable bits for \^ch the extracted feature does not differ substantially 

25 from said threshold. 

13. An arrangement for generating a hash signal identifying an information signal 
inaccordance with the method as claimed in any one of claims 1 to 12. 

30 14. A method of matching an input block of hash words representing at least a part 

of an information signal with bash signals identifying respective information signals stored in 
a database, the method con^rising the st^s of: 

(a) selecting a hash word of said input block of hash words; 

(b) searching said hash word in the database; 
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(c) calculating a difference between the input block of hash words and a stored block of 
hash words in which the hash word found in step (b) has the same position as tixe 
selected hash word in the input block; 

(d) repealing steps (a) to (c) for a further selected hash word until said difference is lower 
S tiian a predetermined threshold 

IS. The method of claim 14, wherein the further selected hash word is another 

hash word of the input block of hash words. 

10 16. The method of claim 14, wherein the furdier selected hash word is obtained by 

reversing a bit of the previously selected hash word. 

17. The method of claim 16, further comprising the steps of receiving information 
which is mdicative of the reliability of the bits of the selected hash word, and using said 

1 5 information to determine the bit to be reversed. 

18. A method of matching a hash value representing an unidentified information 
signal witii a plurality of hash values stored in a database and identifying a respective 
plurality of information signals, tiie method comprising the steps of: 

20 (a) receiving said hash value in the form of a plurality of reliable hash bits and unreliable 
hash bits; 

(b) searching in the database the stored hash values for which holds that the reliable bits of 
the applied hash value match the corresponding bits of the stored hash value; 

(c) for each stored hash value foimd ia step (b), calculating the bit error rate between the 
25 reliable bits of the hash value representing the unidentified information signal and tiie 

corresponding bits of the stored hash value; and 

(d) determining for A;^ch stored hash values tiie bit error rate is minimal and sufKcientiy 
small. 

30 1 9. A method of matching a hash signal representing an unidentified information 

signal with a plurality of hash signals stored in a database and identifying a respective 
plurality of information signals, the method comprising the steps of: 
(a) receiving said hash signal in the form of a series of hash values, each hash value having 
reliable hash bits and uiueliable hash bits; 
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(b) supplying one of the hash values of said series to the database; 

(c) searching in the database the stored hash values for which holds that the reliable bits of 
tbe applied hash value match the corresponding bits of the stored hash value; 

(d) for each stored hash value found in step (c): 

- selecting m fhe database the correspondmg series of stored ha^ values; 

- calculating the bit error rate between the reliable bits of the series of hash values 
representing the unidentified information signal and the corresponding bits of the 
selected series of hash values in the database; and 

(f) determining for which series of stored hash values the bit error rate is nunimal and 
sufBciently small. 

20. The mediod as claimed in claim 19, further comprising the steps of repeating 
steps (b)-(f) for other hash values of the unidentified information signal until a series of 
stored hash values is found for which the bit error rate is minimal and suf&ciently small. 

21. An arrangement for matching an input block of hash words representmg at 
least a part of an information signal with hash signals identifying respective information 
signals stored in a database in accordance with the method as claimed in any one of claims 14 
to 20. 

22. A method of redirecting a receiver of an information signal to an Intemet 
website, the method comprising the steps of deriving a hash signal firom said information 
signal, and matching said hash signal with hash signals identifying Litemet websites stored in 
a database. > 

23. A method of measuring the quality of an information signal, the method 
comprismg the steps of deriving a hash signal fix>m said information signal, matching said 
hash signal with a hash signal identifying said information signal stored in a database, and 
calculating the difierence between the derived hash signal and the stored hash signal. 

24. A method of identifying a multimedia signal, &e method comprising the steps 
of receiving and/or recording at least a part of said multimedia signal, deriving a hash signal 
firom said multunedia signal, sending said hash signal to a database for matching it widi hash 
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signals stored in said database, and receiving from said database an identifier of the 
multimedia signal. 

25. The method of claim 24, wfaerem said steps of receiving and/or recording the 

multunedia signal, deriving and sending liie hash signal, and receiving the identifier are 
performed by a mobile telephone device. 
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